CONVERGENCE OF SEQUENTIAL MARKOV CHAIN MONTE 
CARLO METHODS: I. NONLINEAR FLOW OF PROBABILITY 

MEASURES 



ANDREAS EBERLE AND CARLO MARINELLI 

Abstract. Sequential Monte Carlo Samplers are a class of stochastic algorithms for 
Monte Carlo integral estimation w.r.t. probability distributions, which combine ele- 
ments of Markov chain Monte Carlo methods and importance sampling/resampling 
schemes. We develop a stability analysis by functional inequalities for a nonlinear flow 
of probability measures describing the limit behavior of the algorithms as the number 
of particles tends to inflnity. Stability results are derived both under global and local 
assumptions on the generator of the underlying Metropolis dynamics. This allows us 
to prove that the combined methods sometimes have good asymptotic stability prop- 
erties in multimodal setups where traditional MCMC methods mix extremely slowly. 
For example, this holds for the mean field Ising model at all temperatures. 

Spectral gap estimates, or, equivalently, Poincare inequalities, as well as other related 
functional inequalities provide powerful tools for the study of convergence to equilibrium 
of reversible time-homogeneous Markov processes (see e.g. [9], [10], [11]). In particular, 
they have been successfully applied to analyze convergence properties of Markov Chain 
Monte Carlo (MCMC) methods based on reversible Markov chains (see e.g. [12]). The 
idea of MCMC methods is to produce approximate samples from a probability distri- 
bution /i by simulating for a sufficiently long time an ergodic Markov chain having /i 
as invariant measure. MCMC methods have become the standard to carry out Monte 
Carlo integrations with respect to complex probability distributions in many fields of 
applications, including in particular Bayesian statistics, statistical physics, and compu- 
tational chemistry. We just refer the interested reader to [15] and [19] and references 
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therein as an example of work in this the hteraturc is by now enormous. Since 

the Markov chain is usuaUy started with an initial distribution that is very different from 
/J., strong convergence properties, such as exponential convergence to equilibrium with 
a sufficiently large rate, are required to ensure that the corresponding MCMC method 
produces sufficiently good approximate samples from fj,. However, these strong conver- 
gence properties often do not hold in multimodal, and in particular high-dimensional 
problems, as they arise in many applications. For example, in statistical mechanics 
models with phase transitions, the rate of convergence often decays exponentially in the 
system size within the multi-phase regime. 

In this and a follow-up article we initiate a study of convergence properties by func- 
tional inequalities for a class of algorithms for Monte Carlo integral estimation that are 
a combination of sequential Monte Carlo and MCMC methods. Instead of producing 
constantly improved samples of a fixed distribution fj,, these sequential MCMC methods 
try to keep track as precisely as possible of an evolving sequence {lJ-t)o<t<i3 of probability 
distributions. Here /xq is an initial distribution that is easy to simulate, and fj,^ is the 
target distribution that one would like to simulate. Importance sampling and resam- 
pling steps are included to constantly adjust for the change of the underlying measure. 
Whereas for MCMC methods exponential asymptotic stability is usually required to ob- 
tain improved samples, the sequential MCMC method starts with a good estimate of /iq, 
and one only has to control the growth of the "size" of the error. As a consequence, the 
method sometimes works surprisingly well in multimodal situations where traditional 
MCMC methods fail, cf. also the examples below. The price one has to pay is that 
samples from ^u^ cannot be produced individually. Instead, the corresponding algorithm 
produces directly a Monte Carlo estimator for /i^ given by the empirical distribution 
of a system of interacting particles at the final time. To ensure good approximation 
properties, a large number N of particles is required. 

Variants of such sequential MCMC methods have recently been proposed at several 
places in the statistics literature, see in particular [6] and references therein, as well as [3] . 
However, precise and general mathematical methods for the convergence and stability 
analysis, in the spirit of those developed for traditional MCMC methods by Diaconis, 

Saloff-Coste, Jcrrum, Sinclair, and many others, seem still to be missing - although 
very important first steps can be found in the work of Del Moral and coauthors, cf. 
e.g. [5], [7] and [8]. The classical approach via Dobrushin contraction coefficients is 
usually limited to very regular situations. Moreover, it rarely yields precise statements 
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on the convergence properties, and it can not be combined easily with decomposition 
techniques. 

Our aim is to make variants of the powerful techniques of the spectral gap/Dirichlet 
form approach to convergence rates of time-homogeneous Markov chains (e.g. canoni- 
cal paths, comparison and decomposition results) available in the different context of 
sequential MCMC methods. Mathematically, this means at first to study a class of 
nonlinear evolutions of probability measures by functional inequalities. Such a study 
has been initiated in a related context by Stannat [22]. In this work, wc restrict our- 
selves to the simplest and most natural variant of sequential MCMC, where importance 
sampling/resampling is only used to adjust constantly for the change of the underlying 
distribution, and MCMC steps at time t arc always carried out such that detailed bal- 
ance holds w.r.t. the measure nt (and not w.r.t. (IqI). This seems crucial for establishing 
good stability properties. Note that the type of sequential Monte Carlo samplers studied 
here is different from those analyzed by Del Moral and Doucet in [5]. An algorithmic 
realization has been applied to simulations in Bayesian mixture models by Del Moral, 
Doucet and Jasra in [6], who observed substantial benefits compared to other methods. 

We have divided our work on sequential MCMC methods into two publications: in this 
first article we study the stability properties of nonlinear flows of probability measures 
describing the limit as the number N of particles goes to infinity. In the follow-up 
work [13] we will apply the results to control the asymptotic variances of the Monte 
Carlo estimators as N tends to infinity. The functional inequality approach enables us 
to prove stability properties not only under global but also under local conditions, i.e. 
assuming only that good estimates hold on each set of a decomposition of the state 
space. As a consequence, we obtain a procedure for analyzing the asymptotic behavior 
of sequential MCMC methods applied to multimodal distributions. For example, in 
the spirit of previous results for tempering algorithms by Madras and Zheng [18] and 
others, we can prove good (polynomial in the system size and the inverse temperature) 
stability properties in the case of the mean field Ising model, cf. Section 2.5 below. 
We also demonstrate in a simple exponential model with several modes that the flow of 
probability measures corresponding to sequential MCMC methods has better stability 
properties than the one corresponding to the classical simulated annealing algorithm, 
cf. 2.4. Sequential MCMC methods might hence also provide an efficient alternative to 
simulated annealing. 
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1. Setup 

1.1. Sequential estimation of probability measures. Let S denote a finite state 
space, and fj, a probability measure on S with full support, i.e. n{x) > for all x E S. 
The finiteness of the state space is only assumed to keep the presentation as simple and 
non-technical as possible. Most results of this paper extend to continuous state spaces 
under standard regularity assumptions. By Aii{S) we denote the space of probability 
measures on S. As usual, 

u{f) := j fdu = 

•'^ xes 

denotes the expectation of a function / : S — R w.r.t. a measure u G Mi{S). We con- 
sider methods for Monte Carlo integration with respect to the probability distributions 
of an exponential family 

fitix) = ^ e-*^(^V(^), < t < GO, (1) 

where : 5 ^ [0, oo) is a given function, and Zt := Ylx&s e~^^^^'> l-i-ix) is a normal- 
ization constant. Below, t will play the role of a time parameter for a particle system 
approximation. 

Note that for a fixed /? > 0, any given probability measure u on S that is mutually 
absolutely continuous with respect to n can be written in the form (1) with t = /? by 
setting H{x) = ^ log One should then think of the family {fJ-t)Q<t<i3 of probability 
measures as a particular way to interpolate between the target distribution fip that we 
would like to simulate, and the reference distribution = ji that can be simulated 
more easily. Although we restrict our attention here to this simple way of interpolating 
between two measures, other interpolations can be treated by similar methods. In fact, 
an arbitrary family (Mt)o<t<^ of mutually absolutely continuous probability measures 
on S with smooth dependence on t can be written in the form 

M^) = ^e-Zo^^^^^'^Xx), 0<i</3, (2) 
Zt 

where Zt is a normalization constant, and (s,.t) ^ Us{x) is a continuous non-negative 
function on [0, (3] x S. Our results below extend to this more general case, cf. Section 
2.6 below. 
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The main advantage of the interpolation (1) is that the singularity of nt w.r.t. /x is 
resolved uniformly over time. In particular, 

< osc{H)-\s-t\ for all s, t > and a; G 5 (3) 

where osc{H) := maxg H — ming H. On the other hand, other interpolations, e.g. by a 
spatial coarse graining, may be preferable in concrete applications. 

One way to obtain sequential methods for Monte Carlo estimation of expectation 
values with respect to the measures fi^ is to proceed as follows: 

a) Construct a semigroup {^s,t)o<s<t<oo of nonlinear transformations on the space 
of probability measures on S, such that 

= for all < s < t . (4) 

b) Spatial discretization by interacting particle system: Construct an appropriate 
Markov process {Xj, X^^) on {N € N) related to the nonlinear semigroup 

and estimate fXt = $o,tA* by the empirical distributions 

1 ^ 

of the process with initial distribution fi^ . 

c) Time-discretization: Approximate the continuous time Markov process {X^, . . . , X^) 
by a time-discrete Markov chain on S*^ (which can then be simulated). 

1.2. The non- linear semigroup. To define the nonlinear semigroup ^s,t and the par- 
ticle system we have in mind, we consider the generators (Q-matrices) jCt at time t >0 
of a time-inhomogeneous Markov chain on S satisfying the detailed balance condition 

Ht{x)Lt{x, y) = nt{y)Ct{y, x) y t>0, x,y e S. (5) 

The generators Ct determine the MCMC steps in a corresponding sequential MCMC 
method. We assume that Ct{x,y) depends continuously on t. To compare algorithmic 
performance in a reasonable way, one might also assume 

J2^tix,y)<l yxeS, (6) 

although this is not necessary for the results below. For example, Ct could be the 
generator of a Metropolis dynamics w.r.t. /xj, i.e., 

^t{x,y) = -min ( ) iovx^y, 

\Mx) J 



log 



Mx) 
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Ct{x,x) = — ^y^x^ti^-'V)^ where the proposal matrix Kt is a given symmetric tran- 
sition matrix on S. By (5), Ct defines a symmetric linear operator on L'^{S,fj,t). The 
associated Dirichlet form on functions f,g : S ^M. is 

£t{f,g) := -EtifCtg] = \ ^ (/(y) - - <7(x)) y) , 

x,y&S 

where Ej stands for expectation w.r.t. /xj, and 

{Ctg){x) := ^Ct{x,y)g{y). 
y 

We shall often use the abbreviated notation £t{f) ■= £t{f, /)■ 

We fix non-negative constants Mt {t > 0) that determine the average relative fre- 
quency of MCMC moves compared to importance sampling/resampling steps in a cor- 
responding SMCMC method. Again, we assume that 1 1— >■ is continuous. 

Let Ps,t{x, y) and qs,t{x, y) {x,y E S) be the unique solutions of the forward equations 

Ps,t{MtCtf-Hf), Ps,sf = f, (7) 
qs,t{MtCtf - Htf), qs,sf = f, (8) 

Ht := H-Et[H] . 
The linear semigroups Ps,t and qs,t have the Feynman-Kac representations 

Ps,tf{x) = y"e-/.*^(^'-H)*/(Xt(a;))Pt,,(da;), 

where {Xtj'Pt^x) is a time-inhomogeneous Markov process with generator Mt ■ jCf In 
particular one has 

qs,tf = ^r[H] drj ps,tf. 

We consider the nonlinear semigroup 

{ups,t){S) Hs,t){S) 
on the space Aii{S) of probability measures on 5*. Here 

T^p{y) = X] ^i^)pi^^ y) ■ 

xes 



g-^Ps,tf 



where 
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The semigroup ^s,t describes the time evohition of the law of an inhomogeneous Markov 
chain with generator Mt ■ Ct and absorption rate H, conditioned to be alive at time t 
(see e.g. [2]). It is not difficult to verify that (4) holds, cf. Theorem 1 below. 

1.3. Particle system approximations. To approximate ^s,t one could use a particle 
system consisting of independent Markov chains with absorption, and base the Monte 
Carlo estimation on the particles that are still alive at time t. However, such a proce- 
dure would be usually very inefficient, since in most interesting cases the overwhelming 
majority of particles would have become extinct already at the final time. Instead, se- 
quential Monte Carlo samplers are based on a time-inhomogeneous Markov chain on 
S'^ , N eN, with a generator that is for example of type 

N 

Cff{x\...,x^) = Mt.Y,{C?f){x\...,x^) 

1=1 

1 ^ 

Here C^^^ denotes the application of Ct to the z-th component, and r*'^(x) := y where 
:= and y^ := x^ for all k ^ i. Hence, between the interactions the particles move 
according to time-inhomogeneous Markov chains with generator • Ct and absorption 
rate H, and in case of absorption, the position is replaced by the position of a randomly 
chosen particle. Other interaction terms that correspond to different resampling schemes 
are possible as well. The asymptotics as A/" — oo of the approximating particle systems 
with mean field interaction has been studied intensively, cf. e.g. the monograph [4] . 

1.4. Convergence and stability properties. The quality of Monte Carlo estimates 
of Utif) = J f djit for some function / : S" — > R can be measured by the bias and 
the (asymptotic) variance of the corresponding estimators. The theoretical analysis of 
the sequential MCMC methods considered here can be subdivided into several steps as 
above: 

a) Stability properties of the semigroup <^s,t- 

b) Bias and asymptotic variance of the estimators fif {f) = jf J2iLi fi^t)- 

c) Effect of the discretization in time. 

In this paper, we will focus exclusively on the first step, that is we develop a stability 
analysis for ^s,t based on functional inequalities. A follow-up paper [13] will be devoted 
to the time dependence of the asymptotic (as N — oo) mean square error of the particle 
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system based estimators jlf{f). Let us remark for the moment, that significant work 
in this direction has already been done, e.g., by Del Moral and Miclo in [8]. The results 
clearly indicate that techniques very close to those developed here can also be applied 
to control the asymptotic variances of the approximating particle systems. This will be 
made precise in [13]. 

We also point out that usually the time discretization is carried out before the spatial 
discretization, i.e. one usually directly considers semigroups and particle systems in 
discrete time. Even though this is closer to the algorithmic realization, the convergence 
analysis becomes more transparent in continuous time due to the infinitesimal descrip- 
tion (at least from an analytic perspective). Moreover, the continuous time setup allows 
us to see more clearly how frequently different types of moves of the particle systems 
should be carried out. 

Before stating our results, we comment on relations of sequential MCMC methods to 
several standard methods for Monte Carlo integration: 

- Parallel MCMC is a special case of the algorithm above when H = 0, i.e. /J-t = fJ- 
for all t. In this case the associated particle system consists of independent time- 
homogeneous Markov chains with invariant measure fi. Common problems are slow 
mixing due to multimodality and the burn-in time (i.e. the time needed to reach equi- 
librium from an initial distribution that is far from /x can be much larger than the 
inverse spectral gap). Both problems are particularly significant in high dimensional 
setups ("curse of dimension"). 

- In parallel simulated annealing, the approximating particle system is given by inde- 
pendent time-inhomogeneous Markov chains with generator Ct- There are no interac- 
tions. In this case, the corresponding (linear) semigroup on probability measures does 
not satisfy (4). As a consequence, there is an asymptotic bias of the corresponding Monte 
Carlo estimator, which can only be reduced by the mixing properties of the underlying 
Markov chains. Therefore, in multimodal setups good convergence properties can only 
be guaranteed if the measures fj^ change very slowly (logarithmic cooling schedule). 

- Pure importance sampling/resampling is the special case of our method when = 
for all t. Since the particles cannot explore the state space, it is only applicable for 
small state spaces, or in very special situations. In fact, the results below indicate that 
a certain amount of particle motion is needed to ensure good stability properties. Our 
results below can be used to quantify, at least in principle, how many MCMC moves are 
needed to balance the error growth due to importance sampling/resampling. 
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— A combination of importance sampling and MCMC (without resampling) is similar 
to considering Markov chains with absorption, conditioned to stay alive. This is often 
inefficient, cf. the remark above. 

- Finally, we would like to point out that the analysis of several multilevel sampling 
methods (see e.g. [15]) such as umbrella sampling (cf. [16]), simulated and parallel tem- 
pering (cf. [18], [1], [21]) has been an inspiration for this work. These MCMC methods 
provide samples from mixtures, direct sums, or products of distributions /xt- {0 < i < m), 
= to < ti < ■ ■ ■ < tm, m e N. A disadvantage of umbrella sampling and simulated 
tempering is that the normalization constants have to be estimated in parallel. Parallel 
tempering avoids this disadvantage by simulating the product distribution q t^U with 
the help of a Metropolis chain on S"^~^^ where neighboring coordinates can be swapped. 
However, the swapping procedure seems to slow down the convergence in some cases, 
and it makes the convergence analysis rather intricate, cf. [18]. The sequential MCMC 
methods presented here can be seen as an attempt to overcome these difficulties. The 
estimation of the normalization constant is built into the algorithm, and the evolution 
in t is linear - and thus faster than a diffusive motion in f as in simulated and parallel 
tempering. Once the basic techniques are developed, the convergence analysis seems 
also to be at least partially more transparent for sequential MCMC than for simulated 
and parallel tempering. 

2. Main results 

2.1. Time evolution of the mean square error. Let ut := $o,ti^ for some given 
initial distribution u G Mi{S), and let 

<,.(,) := t> 0, 

my) 

denote the relative density of ut w.r.t. the measure nt defined by (1). Moreover, let 

et :=Et[{gt-lf] 

denote the mean square error (x^-contrast) of Uf w.r.t. nt- Our first result shows that 
Ht = $o,tA*0) and it gives a general method to analyze the stability of this evolution in 
an sense: 

Theorem 1. (i) ut = ^o,w is the unique solution of the nonlinear evolution equation 

^i/t = MtVtCt - Hvt + Vt{H)uu t > 0. (9) 
with initial condition 1/0 = 1/. 
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(a) The densities gt solve 

^^gt = MtCtgt + MH{gt - l)]gt. (10) 
(Hi) The time evolution of the mean square error is given by 

~et = -MtStigt - 1) - ^MHtigt - lf]+MHt{gt - l)]et. (11) 

Remark 2. (9) is the forward equation for the nonhnear semigroup ^s,t (for s = 0). 
The corresponding assertion holds for ut := ^s,t^ for t > s > 0. Since fit solves (9), we 
obtain in particular 

fit = ^s,tfJ's for alH > s > . 

The proof of the theorem is given in Section 3 below. Similar equations have been 
derived in a more general setup by Stannat [22]. 

The main objective of this article is to develop efficient tools to bound the growth 
of Et based on Theorem 1. To estimate the right-hand side of (11) we have to control 
the two terms involving Ht (which correspond to importance sampling/resampling) by 
the Dirichlet form £t (which corresponds to MCMC moves). We first discuss how this 
can be achieved in the presence of a good global spectral gap estimate. Afterwards, we 
give results based on local Poincarc-typc inequalities, which can sometimes be used to 
control the error growth in multimodal setups where good global mixing properties of 
the underlying Markov chains do not hold. 

2.2. Stability based on global estimates. For t > let 

Ct := sup {Et[f]/Stif,f)\ f:S^Rs.t.Et[f]=0, f^O} 

denote the (possibly infinite) inverse spectral gap of Ct, and let 

At := sup {Et[Hff]/£tif, /) | / : 5 ^ M s.t. Et[f] = , / ^ } . 

Thus Ct and At are the optimal constants in the global Poincare inequalities 

Vart(/) < CfStifJ) V/:5^M, and (12) 
Et[Hf{f-Et[f]f] < AfStifJ) V/:5^M. (13) 

Here Var^ denotes the variance w.r.t. fit- 
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Remark 3. (i) There exist efficient techniques to obtain upper bounds for Ct, for 
example the method of canonical paths, comparison methods (see e.g. [20]), as well as 
decomposition methods (see e.g. [14]). Variants of these techniques can be applied to 
estimate At as well, 
(ii) Clearly, one has 

At < Ct ■ sup H^{x), (14) 

x&S 

SO an upper bound on Ct yields a trivial (and usually far from optimal) upper bound on 
At. 

Let 

atiH) := Vart(i7)V2 = EtiH^]'/^ 

denote the standard deviation of H w.r.t. Ht- The next result bounds the error growth 
in terms of Ct and At : 

Theorem 4. // > At/2 for all t > 0, then 

d 2Mt~At 1/2 

— \oget < — + 2at{H)et' (15) 

and 

i. log,, < + 2(^E,K-l)"%-'^ (16) 

The proof will be given in Section 3 below. Inequality (15) is straightforward to prove, 

but sometimes (16) is stronger, since the constants only depend on the negative part of 
Hf. As an immediate consequence of the theorem we obtain estimates on the average 
relative frequency Mt of MCMC moves that is sufficient to guarantee stability of the 
corresponding nonlinear flow of probability measures: 

Corollary 5. Let < Po < Pi, and assume that for all t G (/3o,/3i), 

Mt > ^ + Ctat{H)sX' (17) 

or 

Mt > ^ + {AtCtEtlHr])'/' 4f + I CtEtlHf] s^, . (18) 
Then t^ £t is strictly decreasing on the interval [/3o,/3i]- 
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Remark 6. (i) On the finite state spaces considered here, the constants Cj and At are 
finite if Ct is irreducible. However, in multimodal situations, the numerical values of 
these constants are often extremely large. Alternative estimates based on local Poincare- 
type inequalities are given below. 

(ii) Similarly to the corollary, one obtains that the error decays exponentially with 
rate 7 > 0, i.e. 1 1— e^^ et is decreasing on [/3o,/3i], provided 

Mt > +C,a,(i^)e-^(*-^o)/2 4f G (/3o,/3i), (19) 

or a similar condition replacing (18) holds. 

(iii) One can often assume that the initial error e^p is very small. In this case, Mt 
slightly greater than {At + jCt)/2 is enough to ensure exponential decay with rate 7. 

(iv) The case H = corresponds to classical MCMC. Here At = for all t, so 

det/dt < —2^£t. This yields the classical exponential decay with rate 27 of the mean 
square error in the presence of the global spectral gap Mt/Ct > 7 of the generator 
Mt ■ Ct- For H ^ 0, additional MCMC moves are required to make up for the error 
growth due to importance sampling/resampling. 

Roughly, the corollary says that is the initial error is sufficiently small, the stabiliz- 
ing effects of the MCMC dynamics make up for the error growth due to importance 
sampling/resampling provided Mt > At/ 2. 

Comparison with parallel MCMC. Suppose that we want to simulate jj^js for a fixed /? > 0. 
Parallel MCMC consists in simulating N independent time homogeneous Markov chains 
with generator Cp. This algorithm is clearly a special case of the sequential MCMC 
procedure introduced above, where fit = up for alH > and H = 0. If the chains are 
run with initial distribution /iq, one has 

£t < e-^'/^i^eo < e-2*/t^/5 • (e^^^^^W - 1) 

where we have used that 

^.^^(si-r-w^E^Mx)-:..---. 

Hence to ensure et < £ for a given e > and T > 0, a total running time 

T > ^. (^/3osc(iJ) + log0 

is sufficient. If (6) holds, the number of MCMC steps required for a simulation is of 
the same order as T. Alternatively, we can apply the sequential MCMC method with 
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varying distributions (0 < t < /3) . Using the rough estimate At < Cf ■ sup and 
(18), we see that £t decreases in time if 

Mt > ^Ct sup (1 + 4/')2 V i G (0, 

Thus an expected total number of MCMC steps of order 

^{l + el^y J^^Ct sup Hfdt 

suffices to guarantee stabihty of the corresponding nonhnear semigroup. 

More drastic improvements due to sequential MCMC appear when good global spec- 
tral gap estimates do not hold, as we shall now demonstrate. 

2.3. Error control based on local estimates. Madras and Randall [17] and Jerrum, 
Son, Tetali and Vigoda [14] have shown how to derive estimates for spectral gaps and 
logarithmic Sobolev constants of the generator of a Markov chain from corresponding 

local estimates on the sets of a decomposition of the state space combined with estimates 
for the projected chain. This has been applied to tempering algorithms in [18], [1] and 
[21]. We now develop related decomposition techniques for sequential MCMC. However, 
in this case, we will assume only local estimates for the generators Ct, and no mixing 
properties for the projections - whence there will be an unavoidable error growth due 
to importance sampling/resampling between the components. The results and exam- 
ples below indicate that nevertheless sequential MCMC methods might potentially be 
at least equally efficient as tempering algorithms in many applications. Since mixing 
properties for the projections do not have to be taken into account, the analysis of the 
decomposition simplifies considerably. 

Let < /3o < /3i < oo. We assume that for every t G (/3o,/?i), there exists a 
decomposition 

s = [jsi 

into finitely many disjoint sets with Ht{Sl) > 0, as well as non-negative definite quadratic 
forms £1 {i G /) on functions on S such that 

J2 lH{Si) Siif, f) < K- Stif, /) V t G (/3o, Pi), f:S^R (20) 
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for some fixed finite constant K. For example, one might choose f| as the Dirichlet form 
of the Markov chain corresponding to Ct restricted to SI, i.e., 

StifJ) = I H {f{y)-f{x)fjrt{x,y)nt{x\Si). (21) 

x,yeSl 

In this case, (20) holds with K = 1. 

Let us denote by Ej and Var^, respectively, the expectation and variance w.r.t. the 
conditional measure 

^^l{A) := fit{A\Sl), 
and by vr : 5 ^ / the natural projection. In particular, 

E,[/|7r] = J2^i[f]-Xsv 

ies 

for any function f : S —>-R. We set 

Ht := H-Et[H\TT]. 

Assume that the following local Poincare type inequalities hold for all t G {(3o,Pi) 
and iei with constants Al, Bl G (0, cxd) : 

n[-Ht f] < A ■ /) V/ : 5 ^ M : Ei[/|7r] = , (22) 
\n\Htf\f < Bl-SiifJ) V/:5^M : Ei[/|7r]=0. (23) 



Remark 7. (i) Note that to verify (22) it is enough to estimate Kl[Hff'^], while for 
(23) one has to take into account the positive part of Ht as well. In particular, (22) 
can not be used to derive an estimate of type (23). However, if (22) holds with —Ht 
replaced by \Ht\, then (23) holds with Bf = Ei[\Ht\] ■ A^. 
(ii) If local Poincare inequalities of the type 

VarK/) < CI ■ Slif, /) V / : 5 ^ M, i G /, (24) 

hold, then (22) and (23) hold with = Cj ■ maxg, H^ and Bj = Cf ■ Varj(i?). 

Combining (20) and (22), (23) respectively yields 

M-Htfl] = Y.l^t{Sl)n[-Htif] < AfStifJ) yf:S-^R, (25) 
iei 

and 

J2lH{Si)\Ei[Htft]\^ < BfStifJ) V/:5^M, (26) 
iei 
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where 

At := K ■ maxAl and Bf := K ■ max Bl . 

i i 

The following error estimate is our key result : 
Theorem 8. If Mt > At/2 for all t e {Po,Pi) then 

^ ^\ /o • (1 + + (1 + V^)' ■ "l^ V(0 (27) 
at Mt-At/2 »e-f 

htd) := Ei[H] - Et[H] = - ^logM.(5|) 



{iel). (28) 

s=t 



The proof is given in Section 3 below. To understand the consequences of (27), let us 
first consider the asymptotics as Mt tends to infinity. In this case, (27) reduces to 

^ loge* < (1 + V^f ■ maxht . 

In order to ensure that for t > (3o the error St remains below a given threshold S > 0, 
note that as long as St < S, we have 

■ max ht . 

Thus 

mm{st, 6) < ep, ■ V t G [/3o, Pi] , (29) 

where 

Gt := exp ( / max/i~(ir I = exp ( / max —\og HsiSl) dr \ . 

Remark 9. The term g'^I^^^ in (29) accounts for the maximum error growth due to 
importance sampling between the components. If S\ = S'' is independent of t for every 
i, and there is an io € such that ^ log/Xs(S'*) is maximized by 5"*° for all s G (/3o, A); 
then 

Gt = expf r max -flog M.(S^)ds') = ^*^f'°\ \fte[Po,Pi], 

i.e., Gt is the growth rate of this strongest growing component. In general, things 
are more complicated, but a similar interpretation is at least possible on appropriate 
subintervals of [/?o,/?i]- 

Now wc return to the case when Mt is finite. The next corollary tells us how many 
MCMC moves are sufficient to obtain an estimate on the growth of £t that is not much 
worse than (29). 
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Corollary 10. Let (3 € (/Soj/^i] o,nd 5> and assume that 



Mt>^ + afBt VtG(/?o,/3) (30) 



mm{ep, 6) < ep, • 4'^"^^' " exp ^ ds . Vie [/?o, P] ■ (31) 



for some function a : {l3o,P) — > (0, oo). Then 

r^l + S 
In particular, if 

Mt > ^ + iP-Po)-Bt V t G (/?o,/3) (32) 

then 

min(e^,<5) < ep, ■ G^f^^^'^" ■ e'+' . (33) 

Remark 11. The main difference to Corollary 5 is that under local conditions it can 
not be guaranteed that the error remains bounded. Instead, £t can grow with a rate 
dominated by Q^^'^^') . As already pointed out, this is due to importance sampling 
between the components. 

2.4. Example 1: Exponential model with k valleys in the energy landscape. 

This is an extended version of a model considered in [16], [18] as a test case for some 
multi-level MCMC methods. We fix A; G N, and ri, r2, . . . , rjk G N. Let 5° := {0} and 

S' := {{i,j) ■ j = 1,2,. ..,ri}, 1 < i < k. 

We consider the graph with vertex set 

k 

s = \JS' 

1=0 

and edges (0, {i, 1)), 1 <i < k, and {{i,j), {i,j + 1)), I < i < k, 1 < i < n — 1. Suppose 
that 

H{x) = -d{x,0), xeS, 

where d{x, 0) stands for tlic graph distance of x from 0, i.e., H{0) = and H{{i,j)) = —j. 
We assume that fit is given by (1), where fi is an arbitrary probability distribution on 
S such that fj,{x) > for all a; G 5 and jj, is log-concave on each of the valleys S''' of the 
energy landscape, i.e., 

^(log//((i,j + 1)) + log - 1))) < log n{{i,j)) 

for all 1 < i < and 1 < j < r-j. We consider the setup for sequential MCMC as 
described above where Ct is the generator of the Metropolis dynamics w.r.t. /if based 
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on the nearest neighbor random walk on S. Of course, there are more efficient ways 
to carry out Monte Carlo integrations in this special situation. The point, however, is 
that sequential MCMC methods can be applied even though the underlying structure 
of the energy landscape is unknown. Let R = msLX.i<i<kfi- An application of Corollary 
10 with /3o = and SI = for alH > yields the following result : 



Theorem 12. // 



then 

m.m{ef3,S) < e^+^ ■ soG^^^"^^' ■ eo V(5g(0,1). (34) 

Moreover, if the conditional distribution fj,{-\S^°) lies deeper in one of the valleys than 
in the others in the sense that 

■■ 3 > h}\S'') > ^i{{{i,j) : j > h}\S') , (35) 



then 



^ IhsiS'^) 



10 



and thus 



Remark 13. (i) The last estimate indicates that to obtain good bounds it is crucial 
that the mass allocated by the initial distribution on the component with strongest 
importance growth is not too small (although it can be rather small if the initial distri- 
bution vq is a good approximation of /xq). 

(ii) Let K/s = Mf dt. Note that Kp is a measure for the total number of MCMC 
steps that a corresponding sequential MCMC algorithm will perform on average. The 
theorem implies that choosing constant on [0, /3] with of order 0{(3^) is sufficient 
to guarantee that the nonlinear flow of measures has good stability properties on [0,(5], 
and can thus be used to efficiently approximate up. In contrast to this situation, the 
flow of measures corresponding to the standard simulated annealing algorithm has good 
stability properties only if Kp grows exponentially in p. 
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2.5. Example 2: The mean field Ising model. As a very simple example for a model 
with a phase transition, we now consider the mean field Ising (Curie-Weiss) model, i.e. 
/x^ is of type (1) where /xq = is the uniform distribution on the hypercube 

S = {-1,+!}^, 

and 

1 ~ 

for some N £ N. Let be the generator of the (time-continuous) Metropolis chain 
w.r.t. fi0 based on the nearest neighbor random walk on S as proposal matrix. It is well 
known that this chain is rapidly mixing (i.e. the spectral gap decays polynomially in N) 
for /? < 1, but torpid mixing holds (i.e. the spectral gap decays exponentially fast in A'') 
for (3 > 1. Thus in the multi-phase regime /? > 1, the classical Metropolis algorithm 
converges to equilibrium extremely slowly for large N. 

Now assume for simplicity that N is odd, and decompose S into the two components 



S+ := laeS 



S- := <aeS 



N >j 

i=i ) 

N >| 



and 



Improving on previous results (e.g. of Madras and Zheng [18]), Schweizer [21] showed 
recently that the spectral gaps of the restricted Metropolis chains on both and S~ 
are bounded from below by ^N~^ for every t > 0. Applying the results above to the 
error growth for the non-linear semigroup corresponding to sequential MCMC in this 
situation, we obtain : 

Theorem 14. For every /3 > and N EN, 

sup Et < ■ £0 

0<t</3 

holds whenever eo < 1 and 

Mt > ^N^ + IpN^ Vte(0,/3). (38) 

4 8 

Remark 15. (i) The result is based on a rough estimate of At and Bt in terms of the 

local spectral gap. We expect that a more precise estimate of these constants would 
yield a smaller power of A^ in (38). Furthermore, for /? < 1, the result can be improved 
by applying global instead of local spectral gap estimates. However, our main interest 
is the phase transition regime. 
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(ii) Related results for the mean field Ising model have been obtained for mixing 
times of Markov chains for umbrella sampling in [16], and for simulated and parallel 
tempering in [18], [1], [21]. Schweizer [21] obtains an upper bound on the order in N 
and /? of the mixing time (inverse spectral gap) for simulated tempering that is close 
to the one in (38). In contrast, the best known order for parallel tempering is much 
worse. In general, it seems that the analysis of sequential MCMC is partially simpler 
than the one for parallel tempering, where one has to take into account that a particle 
can only move in temperature if another particle moves in the opposite direction. In fact, 
for this reason we would expect that sequential MCMC methods can have substantial 
advantages compared to parallel tempering. 

(iii) The theorem can be extended to a mean field Ising model with magnetic field. In 
this case, however, one has to take into account an additional (but well controlled) error 
growth due to importance sampling/resampling between the components. Moreover, 
the decomposition into the two components will now depend on t. Without magnetic 
field this is not the case because of the built-in symmetry. 

2.6. Extensions. As remarked above, our results immediately extend to the case where 

l_n{x) = ^e-^ot^^W'^^^(x) {0<t<l3) 

for a continuous function {s,x) Us{x) on [0, /?] with Us{x) > for all s G [0,P] and 
X £ S. In this case, the evolution equations (9) and (10) in Theorem 1 take the form 

d 

—vt = MtVtCt - Uti't + T^t{Ut)vt, and 
= MtCtgt + MUt{gt-^)]9t■ 
Th.e evolution equation (11) for the mean square error and all the stability estimates in 
Sections 2.2 and 2.3 still hold if H is replaced by Ut, and, correspondingly, 

Ht = Ut- MUt] . 

All proofs are completely analogous. 

The extension of the results to more general state spaces requires some (standard) 
technical assumptions which make the proofs slightly less transparent. We postpone this 
extension to a future publication where we will also consider corresponding applications. 
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3. Proofs 

3.1. Proof of Theorem 1. To simplify the notation, we assume Mf = 1 for all t > 0. 

The general case is similar with Ct replaced by Mf Ct- Let us also set pt := po,t and 
$t := $o,t- Then one has 

Vt = = TT— . 

The forward equation (7) yields 

^upt = vptCt - Hvpt. 

Since {i'pt){l) > and Ctl = for all t, we obtain 

d d vpt 



dt ' dt {upt){l) 

vptCt - Hupt (i^PtCt - Hiypt){l) vpt 



(39) 



{up,){l) {upt){lf 
= vtCt - Hut + yt{H)vt . (40) 

Next, we derive a corresponding evolution equation for the densities 

Since /x^ has full support and is differentiable in t, we obtain 

d 1 d ut d , 

'^9t = —-^.^t ;^ log /it. (41) 

at jjLt ot jjit at 

Note that by the detailed balance condition (5), the relative density of vt^t w.r.t. jit is 

my) ^ H\y) v 

Hence (40) yields 

1 Q 

— WL^t = ^tgt - Hgt + vt{H) gt 
lit dt 

= {jrt-H)gt +Et[Hgt]gt. (42) 
Recalling that iJ,t = -^e~^^ fj, with Zt = J2s^~^^^^^ f^iy)y one has 

^log//t = -iit{H-Et[H]) = -HtHt, (43) 



hence 



-^^log//t = {H-Et[H])gt. (44) 
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Inserting (42) and (44) into (41) we obtain 

d 
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dt 



gt = jCtgt + {Et[Hgt]-Et[H])gt 
= Ctgt + Et[H{gt-l)\gt. 



(45) 



We are now ready to derive the equation for the quadratic error 



et = Et[{gt-l?]. 



Differentiating this expression with respect to t one gets by (45) and (43), 



et = 2Et 



{-Ql9t){9t - 1) 



d_ 

di 



{9t - '^f iz^og fit 



= 2Et [{Ctgt){gt - 1)] + 2Et [gt{gt - 1)] Et [H{gt - 1)] - [Ht{gt - if] 
= 2Et [Ctigt - 1) {gt - 1)] + 2Et [{gt - if] Et [H{gt - 1)] - E^ [Ht{gt - if] 
= -2St{gt - 1) + 2Et [H{gt - 1)] • - E* [Ht{gt - if] . 

In the derivation we have used that 

Et[gt-l] = i^t{l)-M'i-) = 0, 

and Cfl = 0- The equation imphes (11) in the case Mt = 1. The general case follows 
similarly. □ 



3.2. Proof of Theorem 4. We have to estimate the terms on the right hand side of 

(11). By the assumed iif-Poincare inequality (13), we obtain 

-^Et[Ht{gt-lf] < ^Et[H^{gt-lf] < ■ St{gt - I) . 



Moreover, 



Et[Ht{gt-l)] < {Et[H^]f\Et[{gt-lf]y'' = at{H)e, 



2l\V2 



-1/2 



Substituting into (11) yields 
d 



dt 



et < -2{Mt~ At/2)8t{gt-l) + 2at{H)£ 

2Mt - At 3/2 
< 7^ et + 2at{H)et' , 



3/2 



by the global Poincare inequality (12), provided Mt > At/2. This proves (15). 
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On the other hand, 

Et [Ht{-{gt-lf/2 + {gt-l)et)] 
1 
2 



= l-Kt [H^ {gt - 1)2] + Et [H^{1 - gt)] e* (46) 



+Et [H+{ - {gt - lf/2 + {gt - !)£*)] . 

Estimating the three summands on the right hand side separately yields 

Et[Hi{gt-lf] < Af£t{gt-l) 

by the if-Poincare inequality (13), 

Et[H^{l-gt)\ < Et[Hr]'/'Et[Hf{gt-in/' 
< Et[Hr]'/'Al/'£t{gt - 1)V2 

by the Cauchy-Schwarz inequality and (13), and 

Et[H+{-{gt-lf/2 + {gt-l)et)] < ^E,[if+]£? = ^E,[i7,-]4- 

The last estimate follows since 

^ej > {gt-l)et-^{gt-l)' 

and 

Et[H+] - Et[H^] = Et[Ht] = . 
By combining the estimates, (46) and (11), we obtain 

j^et < -{2Mt - At) St{gt - 1) + 2Ay^Et[Hr]^/%{gt - Xf^'^et + Et\Ht\ 4 ■ 
This combined with the global Poincare inequality (12) yields 

j,et < -^^^e, + 2^E,m^/%^/^+E,m4, 

and hence (16). □ 

3.3. Proof of Corollary 5. If (17) or (18) holds for t G (/3o,/5i), then by Theorem 4 
and continuity, i i-^ £t is strictly decreasing near Pq and near any s £ (/3o,/3i) such that 
£s < £/3o- Hence it is strictly decreasing on the whole interval [j3o,Pi]. □ 
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3.4. Proof of Theorem 8. Similarly to Theorem 4, we have to control the right hand 
side of (11), but now by using only local Poincare type inequalities. Let 

ft := gt-1 and ft := ft-^t[fM ■ 



Then 



Et [Hf{-{gt-lf/2 + {gt-l)et)] 
= Et [Ht {-{gt- 1) V2 + {gt - i)et) 

+ ^ ^,t{Si) {El[H] - Et[H]) . El [-{gt - lf/2 + {gt - l)st] 



= ~Et [Htff] - Et [Htft [ft Itt]] + Et [Htft et] 

+ ^ i,t{Sl) ■ {El[H] - Et[H]) . El [-fi/2 + ftst] (47) 

= -\ Et[Htn] + I^ti^l) ■ El[Htft\ ■ {st - El[ft]) 

+ Y,^^t{Si)ht{i)-E^t[-fI/'^ + ftet\. 
iei 

Here we have used the definitions of Ht, Ht and ht, and the fact that Et[H^t|7r] = 0. 
We now estimate the three summands on the right hand side separately. By the local 
iJ-Poincare inequality (25), 

-lEt[Htff] <lAfSt{ft). 

By (26), and since 

Y^Msimft] = Et[ft] = 0, 

i 

we have 



Y,MSl)-El[Htft]-{et-Ei[ft]) 

iei 



1/2 / N 1/2 

2 I 



< Yl /^*(^*) ^IWt? E MSI) {et - E^tUt]) 
Kiel ) Kiel 

< By%{fty/'.{el + Y^^t{Sl)El[fi]f' 
= (^BtSt{ft)-ef{l + et)y\ 

Moreover, since 

-/2/2 + M < ej/2. 
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we obtain 

^fitiSl)htii)El[-f^/2 + ftet] 
iei 

iei iei 
< Q ^? + ^ ^* + ■ ™ax/i^ = £*•(! + ■ max^j 

Here we have used that 

Y.^^t(^i)htii} = XlMt(5t^)/it"(i) < max/i^, and 



1/2 1/2 

= 



Combining the estimates yields by (11) and (47) : 
~et < -MfStift) +Et[Ht{-ff/2 + ftet)] 

< -(^Mt-^y Stift) + [Bt£t{ft)etil + et)y^^ + ^et{l + ^tfma^h 

B 1 

- o.^ * A gt(l + £t) + i:£t (l + Vg^)VaxV , 
IMt — At ^ 

provided Mt > At/2. This proves (27). 
Moreover, for any subset S, 

|log/xt(A) = ^log^e-*^(-)Mx) - ^logZt 

= -Et[H\A] + Et[H\, 

which proves (28). 

3.5. Proof of Corollary 10. Assume that (30) holds, and let 

ut := et/G\ 

Then by the definition of Gt-, Theorem 8, and (30), 



^ log tit = ^loget - (1 + \/5)^max/ij 
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for all t € {Po,l3) such that et < 6. Hence 

^(i+v^)2 . 1 + 6 ^(i+Vs)^ 
et = UfGy^ ' < £f3o ■ exp / ds-G).^ ' 

holds for t G [Po,(3] provided the right hand side is smaller than 6. This proves (31). 
The second assertion is a straightforward consequence. □ 

3.6. Proof of Theorem 12. The log-concavity of /j, easily implies that /xt as well is 
log-concave on for all t > and I < i < k. In particular, the restriction of nt to 
has a unique local maximum for every i. By the method of canonical paths it is then 
not difficult to prove that the spectral gap of the Metropolis dynamics w.r.t. ^t(-|S'*) 
based on the standard random walk is bounded from below by l/2rf for all t > and 
1 < i < k, ci. e.g. Proposition 6.3 in [12]. Now we are in the setting of Remark 7 (ii), 
according to which (22) and (23) hold with as in (21), 

4 = 2rf, and Bl = . 

Estimate (34) now follows by a straightforward application of Corollary 10. 

To prove the second part of the assertion, we show that (35) places us in the setting 
of Remark 9. In fact, for t > 0, 

^logntiS') = Et[H]-Ei[H] foralH, and 

If (35) holds, then for any t > 0, the right hand side is maximized when i = iq. Hence 
by Remark 9, 



Gt = for all t > 0. 



□ 



3.7. Proof of Theorem 14. Since -N/2 < H{a) < for all a, we have osc (H) < N/2 
and ^ 

yartiH\S+) = yart{H\S-) < Q osc (i/)^ < N^/8 

for every t>0. By Schweizer's result [21], a local Poincare inequality of type (24) holds 
on S+ and S' with = = QN^. Hence by Remark 7 (ii), (22) and (23) hold with 

Af = In^ and Bf = ^ N" . 
* 2 * 8 

The assertion now follows from Corollary 10, since 

E+[H] = K^[H] = Et[H] . 
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□ 
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